The NCI-60 cancer cell lines were developed as a drug screening tool focusing on a range of cancer types. In this vignette, we compare two major groups of NCI-60 cancer cell lines. The first is the leukemia subgroup consising of 6 cell lines (CCRF-CEM, HL-60 (TB), K-562, MOLT-4, RPMI-8226, SR). The second group is the breast/prostate/ovarian cell lines consisting of 14 total cell lines (BT-549, DU-145, HS 578T, IGROV1, MCF7, MDA-MB-231/ATCC, NCI/ADR-RES, OVCAR-3, OVCAR-4, OVCAR-5, OVCAR-8, PC-3, SK-OV-3, T-47D). The latter cancers are grouped together as they contain common susceptibility loci. This vigenette will highlight the analysis we conduct on the NCI-60 cancer cell lines.
IntLIM, available from Github, can be installed as in the documentation. Once IntLIM is installed, what is necessary is loading in the library.
rm(list = ls())
library(IntLim)
For the NCI-60 cell lines, metabolomics and gene expression data were downloadable from the DTP website (https://wiki.nci.nih.gov/display/ncidtpdata/molecular+target+data). The Metabolon data consisting of 353 metabolites and 58 cell lines with 177 technical replicates total was filtered for metabolites that had a median coefficient of variation of below 0.3. The coefficient value was arbitrarily selected to filter out technical replicates having high variability. The resulting metabolite abundance data set of 280 metabolites was subsequently log2 transformed. Probes from the Chiron Affymetrix U133 were mapped to genes using the Ensembl database hgu133.plus.db (Bioconductor package: hgu133plus.db). Probes mapping to more than one gene were removed. In cases where more than one probe was matching to a given gene, only the probe with the highest mean expression was used for analysis. This resulted in a total of 17,987 genes.
This data has been formatted for IntLIM. We load in the data as follows. The nci60.csv meta file contains a list of phenotypic data file, metabolite data file, gene expression data file, metabolite meta file, and gene expression meta file. The function OutputStats will give a summary of the NCI-60 data
inputData <- IntLim::ReadData('nci60.csv',metabid='id',geneid='id')
## Setting options('download.file.method.GEOquery'='auto')
## Setting options('GEOquery.inmemory.gpl'=FALSE)
## [1] "CreateMultiDataSet created"
IntLim::OutputStats(inputData)
## Num_Genes Num_Metabolites Num_Samples_withExpression
## 1 17987 280 57
## Num_Samples_withExpression.1 Num_Samples_inCommon
## 1 58 57
From the OutputStats, we find that we have gene expression data involving 17,987 genes with 57 cell line sample and metabolite abundance data involving 280 metabolites with 58 cell lines.
We remove genes with mean belows below the 10th percentile. Furthermore, we remove metabolites with more than 80% missing values. This results in gene expression data involving 16188 genes and 57 cell lines and metabolite abundance data involving 220 metabolites and 58 cell lines.
inputDatafilt <- IntLim::FilterData(inputData,geneperc=0.10, metabmiss = 0.80)
## [1] "No metabolite filtering by percentile is applied"
IntLim::OutputStats(inputDatafilt)
## Num_Genes Num_Metabolites Num_Samples_withExpression
## 1 16188 220 57
## Num_Samples_withExpression.1 Num_Samples_inCommon
## 1 58 57
We can obtain boxplot distributions of the data as follows. This is used to make figures.
IntLim::PlotDistributions(inputDatafilt)
The principal component analysis is performed on filtered metabolite and gene expression data to obtain visual representations showing how different sub-sections of the data could be grouped into different clusters. Common samples belonging to either the Leukemia or Breast/Prostate/Ovarian groups are shown. Blue samples indicate leukemia cell lines and red samples indicate BPO. Note the clear delineation of samples.
PlotPCA(inputDatafilt, stype = "cancergroup", common = T)
The linear model is for integrating transcriptomic and metabolomics data is: E(m|g,t) = β1 + β2 g + β3 p + β4 (g:p) + ε where ‘m’ and ‘g’ are log-transformed metabolite abundances and gene levels respectively, ‘p’ is phenotype (cancer type, patient diagnosis, treatment group, etc), ‘(g:p)’ is the association between gene expression and phenotype, and ‘ε’ is the error term that is normally distributed. A statistically significant p value of the ‘(g:p)’ association term indicates that the slope relating gene expression and metabolite abundance is different from one phenotype compared to another. We run a linear model on common cell lines of the leukemia (n = 6) and BPO (n = 14) that included 16188 genes and 220 metabolites (total of 3,561,360 possible associations and hence models). For genes and metabolites that had standard deviations of 0 in one of the groups, we assign a p value of NA. The model is run as below by calling RunIntLim(). DistPvalues() allows us to obtain a distribution of p-values for the (g:p) term.
myres <- IntLim::RunIntLim(inputDatafilt, stype="cancergroup")
## [1] "Running the analysis on"
##
## BPO Leukemia
## 14 6
## [1] "10 % complete"
## [1] "20 % complete"
## [1] "30 % complete"
## [1] "40 % complete"
## [1] "50 % complete"
## [1] "60 % complete"
## [1] "70 % complete"
## [1] "80 % complete"
## [1] "90 % complete"
## user system elapsed
## 124.207 1.284 125.566
IntLim::DistPvalues(myres)
The next step is to process the results of this model by filtering the results of the linear model by FDR-adjusted p-value cutoff (0.10 selected here) for the (g:p) association coefficient and calculate the correlations of the gene-metabolite pairs in each group (BPO and Leukemia) for the filtered results. We further may only interested in results that have an absolute correlation value difference (0.50 selected here). This is done with the ProcessResults function. In addition we also develop a heatmap of the gene-metabolite association correlations for the selected groups.
myres <- IntLim::ProcessResults(myres, inputDatafilt, diffcorr = 0.5, pvalcutoff = 0.10)
IntLim::CorrHeatmap(myres)
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`
dim(myres@corr)
## [1] 1009 4
From this model we find 1009 gene-metabolite correlations that have an association FDR-adjusted p-value of 0.10 and an absolute value correlation difference of 0.5 or greater. The top pairs are shown below.
corr.table <- myres@corr
abs.corrdiff <- abs(myres@corr$BPO - myres@corr$Leukemia)
sort.table <- corr.table[order(-abs.corrdiff),]
sort.table[1:10,]
## metab gene BPO Leukemia
## 42 malic acid FAM174B 0.8285714 -1.0000000
## 637 X-1597 DLG4 0.8373626 -0.9276337
## 636 X-1497 STK4 0.7973588 -0.9428571
## 623 L-beta-imidazolelactic acid DNER -0.7909295 0.9411239
## 317 X-4516 DIS3L2 0.7318681 -1.0000000
## 625 phenylalanine SUZ12 0.7846154 -0.9428571
## 246 X-3094 DND1 0.7670330 -0.9428571
## 39 leucine DLG4 0.7802198 -0.9276337
## 942 X-2757 POMZP3 -0.7626374 0.9411239
## 1 (p-Hydroxyphenyl)lactic acid TYMS 0.8689425 -0.8285714
We can show some example plots of some of these pairs. The first example is the FAM174B vs. malic acid.
IntLim::PlotGMPair(inputDatafilt, stype = "cancergroup", geneName = "FAM174B", metabName = "malic acid")
Another example is DLG4 vs. leucine.
IntLim::PlotGMPair(inputDatafilt, stype = "cancergroup", geneName = "DLG4", metabName = "leucine")